Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control

نویسندگان

چکیده

End-to-end text-to-speech synthesis systems achieved immense success in recent times, with improved naturalness and intelligibility. However, the end-to-end models, which primarily depend on attention-based alignment, do not offer an explicit provision to modify/incorporate desired prosody while synthesizing speech. Moreover, state-of-the-art use autoregressive models for synthesis, making prediction sequential. Hence, inference time computational complexity are quite high. This paper proposes Prosody-TTS, a data-efficient speech model that combines advantages of statistical parametric neural network models. It also has modify or incorporate at finer level by controlling fundamental frequency ( $$f_0$$ ) phone duration. Generating utterances appropriate rhythm helps improving synthesized We explicitly duration phoneme have control over them during synthesis. The is trained fashion directly generate waveform from input text, turn depends auxiliary subtasks predicting duration, , Mel spectrogram. Experiments Telugu language data IndicTTS database show proposed Prosody-TTS achieves performance mean opinion score 4.08, very low using just 4 hours training data.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Whispered Speech Prosody Modeling for TTS Synthesis

This paper is devoted to modeling prosody of whispered Russian speech. The practical purpose of this research is to extend voice cloning techniques to whispered speech modality. The authors present their analysis of prosodic features that contribute to the expression of sentence type intonation in whispered speech. The current investigation includes intonation contours in complete and incomplet...

متن کامل

Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the prosody of the reference signal with fine time detail even when the reference and sy...

متن کامل

Prosody Annotation for Unit Selection Tts Synthesis

This paper concerns prosody annotation and intonation modeling, especially for the application in a corpus based speech synthesis. In order to establish the rules of the automatic intonation modeling, a four hour fully annotated speech database has been acoustically and perceptually analyzed. The speech material included different text types, dialogs and prosodically rich phrases. As the result...

متن کامل

Prosody control in HMM-based speech synthesis

In HMM-based speech synthesis, trained statistical models (context-dependent HMMs) are used to predict duration and generate parameters like mel-cepstral coefficients, log F0 values, and bandpass voicing strengths using the maximum likelihood parameter generation algorithm including global variance (Toda et al, 2007). In the later stages, F0 parameters, bandpass voicing strengths, and the five ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Circuits Systems and Signal Processing

سال: 2022

ISSN: ['0278-081X', '1531-5878']

DOI: https://doi.org/10.1007/s00034-022-02126-z